xxxxxxxxxxThe project aims to predict the number of wins for a MLB team in 2015 based on 16 features from 2014 data. The features include W, R, AB, H, and 2B, which are statistics related to pitching, scoring, batting, and hitting. The output is a numerical value of wins.The features also include 3B, HR, BB, SO, and SB, which are statistics related to hitting, walking, striking out, and stealing bases. Another feature is RA, which measures the rate of runs allowed or scored.The features also include ER, ERA, CG, SHO, and SV, which are statistics related to runs allowed, pitching performance, and game completion. These features indicate how well a pitcher can prevent the opposing team from scoring.The last feature is E, which measures the number of errors committed by the fielders that allow the offense to gain an advantage. The output is the number of predicted wins (W) for a team in 2015 based on these features. For more details on baseball statistics, you can visit this link: https://en.wikipedia.org/wiki/Baseball_statisticsThe project aims to predict the number of wins for a MLB team in 2015 based on 16 features from 2014 data. The features include W, R, AB, H, and 2B, which are statistics related to pitching, scoring, batting, and hitting. The output is a numerical value of wins.The features also include 3B, HR, BB, SO, and SB, which are statistics related to hitting, walking, striking out, and stealing bases. Another feature is RA, which measures the rate of runs allowed or scored.The features also include ER, ERA, CG, SHO, and SV, which are statistics related to runs allowed, pitching performance, and game completion. These features indicate how well a pitcher can prevent the opposing team from scoring.The last feature is E, which measures the number of errors committed by the fielders that allow the offense to gain an advantage. The output is the number of predicted wins (W) for a team in 2015 based on these features. For more details on baseball statistics, you can visit this link: https://en.wikipedia.org/wiki/Baseball_statistics
xxxxxxxxxximport pandas as pdimport seaborn as snsimport numpy as npimport matplotlib.pyplot as plt%matplotlib inlinefrom scipy.stats import zscoreimport warningswarnings.filterwarnings('ignore')xxxxxxxxxxdf = pd.read_csv('baseball.csv')xxxxxxxxxxdf.head()xxxxxxxxxxThe data has 5 rows and 17 columns, showing the input features related to offense, defense, and pitching, and the output of predicted wins.The input features are statistics related to scoring, batting, and stealing bases. The output is the number of predicted wins, which depends on the pitcher’s performance.The data has 5 rows and 17 columns, showing the input features related to offense, defense, and pitching, and the output of predicted wins.The input features are statistics related to scoring, batting, and stealing bases. The output is the number of predicted wins, which depends on the pitcher’s performance.
xxxxxxxxxxdfxxxxxxxxxxdf.shapexxxxxxxxxxThe data has 30 rows and 17 columns.The data has 30 rows and 17 columns.
xxxxxxxxxxdf.columnsxxxxxxxxxxThe data shows the names of all the features.The data shows the names of all the features.
xxxxxxxxxxdf.info()xxxxxxxxxxThe data has 30 records, 17 features, and no missing values. The features have different data types: one is float and 16 are integer.The data has 30 records, 17 features, and no missing values. The features have different data types: one is float and 16 are integer.
xxxxxxxxxxdf.isnull().sum() #checking any null valuesxxxxxxxxxxThe data is complete and has no missing values.The data is complete and has no missing values.
xxxxxxxxxxdf.describe()xxxxxxxxxxThe data is complete, numerical, and has different scales. Some features are not normal.The data is complete, numerical, and has different scales. Some features are not normal.
xxxxxxxxxxdf.describe().Txxxxxxxxxx#The output we want to predict is W.df.W.unique()xxxxxxxxxx#The goal is to use regression to estimate the number of wins for a team.xxxxxxxxxx#Let’s start by examining the output variable.sns.distplot(df.W)xxxxxxxxxx#The output variable looks like it follows a normal distribution.xxxxxxxxxxsns.histplot(df.W, bins=30)xxxxxxxxxx#The number of wins is influenced by different features, so we have to explore the correlations.xxxxxxxxxx# Let’s begin by looking at the distribution of all the features.for i in df.columns: sns.distplot(df[i]) plt.show()xxxxxxxxxxThe features have different distributions:R - has a right skew and two modesAB - is close to normal distributionH - has two peaks2B - is slightly left skewed3B - is approximately normalHR - has some skewness on both sidesBB - also has some skewness on both sidesSO - has a skewed distributionSB - is right skewedRA - is skewedER - is slightly skewedERA - is skewed on both sidesCG - is not normally distributedSHO - is not normally distributedSV - is right skewedE - has two peaksxxxxxxxxxximport seaborn as snsfrom scipy.stats import skewimport matplotlib.pyplot as pltimport warningswarnings.filterwarnings('ignore')xxxxxxxxxxsns.pairplot(df,size=1,kind='kde')xxxxxxxxxxThe chart below shows how W (Wins) is correlated with other features.The chart below shows how W (Wins) is correlated with other features.
xxxxxxxxxxplt.figure(figsize=(20,7))sns.heatmap(df.corr(),annot=True)xxxxxxxxxxObservation:-The chart shows that RA, ER, and ERA are very similar and have a strong negative effect on W. We will remove RA and ER and keep ERA for the prediction model.Observation:-
The chart shows that RA, ER, and ERA are very similar and have a strong negative effect on W. We will remove RA and ER and keep ERA for the prediction model.
xxxxxxxxxxdf.drop(['RA', 'ER'], axis=1,inplace=True)xxxxxxxxxxdf.shapexxxxxxxxxxWe are left with 15 features after removing two columns from the data.We are left with 15 features after removing two columns from the data.
xxxxxxxxxxdf.isnull().sum() #checking null valuesxxxxxxxxxxsns.heatmap(df.isna())xxxxxxxxxxsns.heatmap(df.isnull())xxxxxxxxxxThe data has no missing values.The data has no missing values.
xxxxxxxxxxdf.corr()xxxxxxxxxxplt.figure(figsize=(20,7))sns.heatmap(df.corr(),annot=True)xxxxxxxxxxERA is the only negative feature left in the data. The data has low correlation with wins. SV, SHO, BB, 2B, and R are moderately correlated with wins. ERA is also related to SV and SHO.ERA is the only negative feature left in the data. The data has low correlation with wins. SV, SHO, BB, 2B, and R are moderately correlated with wins. ERA is also related to SV and SHO.
xxxxxxxxxxplt.figure(figsize=(10,5))plt.scatter(df['W'], df['ERA'], c='Red')plt.xlabel('wins')plt.ylabel('Earned run average')plt.show()xxxxxxxxxxThe data shows that more wins are associated with lower Earned Run Average. Teams with 65 to 88 wins have 4 to 5 runs on average, while teams with 90 to 100 wins have 3 to 4 runs on average. This may indicate that playing aggressively leads to more runs but also more risks.The data shows that more wins are associated with lower Earned Run Average. Teams with 65 to 88 wins have 4 to 5 runs on average, while teams with 90 to 100 wins have 3 to 4 runs on average. This may indicate that playing aggressively leads to more runs but also more risks.
xxxxxxxxxxplt.figure(figsize=(10,5))plt.scatter(df['W'],df['SV'],c='Red')plt.xlabel('wins')plt.ylabel('Saves')plt.show()xxxxxxxxxxWe can see here that saves have an impact on wins. Higher saves increase the chances of winning.We can see here that saves have an impact on wins. Higher saves increase the chances of winning.
xxxxxxxxxxplt.figure(figsize=(10,5))plt.scatter(df['W'],df['SHO'],c='Red')plt.xlabel('wins')plt.ylabel('shutouts')plt.show()xxxxxxxxxxShutouts have a slight effect on the data. In some cases, higher shutouts are associated with more wins.Shutouts have a slight effect on the data. In some cases, higher shutouts are associated with more wins.
xxxxxxxxxxplt.figure(figsize=(10,5))plt.scatter(df['W'],df['BB'],c='Green')plt.xlabel('wins')plt.ylabel('Walk')plt.show()xxxxxxxxxxWe can also observe that more walks are related to more wins.We can also observe that more walks are related to more wins.
xxxxxxxxxxplt.figure(figsize=(10,5))plt.scatter(df['W'],df['2B'],c='Green')plt.xlabel('wins')plt.ylabel('Doubles')plt.show()xxxxxxxxxxDoubles also contribute to wins. More doubles in a match lead to more wins.Doubles also contribute to wins. More doubles in a match lead to more wins.
xxxxxxxxxxplt.figure(figsize=(10,5))plt.scatter(df['W'],df['R'],c='Green')plt.xlabel('wins')plt.ylabel('Runs')plt.show()xxxxxxxxxxMost of the winning matches have scores between 600 and 800, so the data does not show much variation. However, there are some outliers as well.Most of the winning matches have scores between 600 and 800, so the data does not show much variation. However, there are some outliers as well.
xxxxxxxxxxplt.figure(figsize=(18,10))for i in enumerate(df): plt.subplot(3,5,i[0]+1) sns.boxplot(df[i[1]])xxxxxxxxxxThere are outliers in the data for only 5 columns: R, ERA, SHO, SV, and E.There are outliers in the data for only 5 columns: R, ERA, SHO, SV, and E.
xxxxxxxxxxfrom scipy import statsxxxxxxxxxx# zscoreimport pandas as pdfrom scipy.stats import zscorez = np.abs(zscore(df))print(np.where(z>3))xxxxxxxxxxdf_1 = df[(z<3).all(axis=1)]print("With Outliers::",df.shape)print("After Removing Outliers::",df_1.shape)xxxxxxxxxxOnly one row was eliminated by using the Z-score method.Only one row was eliminated by using the Z-score method.
xxxxxxxxxx# Calculate the z-scores for each columndf_zscore = df.apply(zscore)xxxxxxxxxx# Access the z-score of a specific elementz = np.array(df_zscore)z_score = z[5, 1]print("Z-score of element (5, 1):", z_score)xxxxxxxxxxIn this code, we first import the necessary libraries. Then, we load the "baseball.csv" dataset into a pandas DataFrame called df. We then use the apply function along with zscore to calculate the z-scores for each column in the DataFrame. Next, we convert the resulting DataFrame df_zscore into a NumPy array z. Finally, we access the z-score of a specific element in the array using the NumPy indexing syntax z[5, 1] and print it to the console.In this code, we first import the necessary libraries. Then, we load the "baseball.csv" dataset into a pandas DataFrame called df. We then use the apply function along with zscore to calculate the z-scores for each column in the DataFrame. Next, we convert the resulting DataFrame df_zscore into a NumPy array z. Finally, we access the z-score of a specific element in the array using the NumPy indexing syntax z[5, 1] and print it to the console.
xxxxxxxxxx# Print the z-scoresprint(df_zscore)xxxxxxxxxx#IQRfrom scipy import statsIQR = stats.iqr(df)IQRxxxxxxxxxxQ1 = df.quantile(0.25)Q3 = df.quantile(0.75)xxxxxxxxxxdf_out = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]print(df_out.shape)xxxxxxxxxxThe IQR method does not detect any outliers. This is logically possible because the data is very small. We choose the Z-score method and do not remove any outliers because we have very little data.The IQR method does not detect any outliers. This is logically possible because the data is very small. We choose the Z-score method and do not remove any outliers because we have very little data.
xxxxxxxxxxdf = df_1xxxxxxxxxxdf.shapexxxxxxxxxxx= df.drop(['W'],axis=1)y=df['W']xxxxxxxxxxplt.figure(figsize=(25,20))for i in enumerate(x.columns): plt.subplot(5,3,i[0]+1) sns.distplot(df[i[1]],color='g')xxxxxxxxxxx.skew()xxxxxxxxxxSome columns (H, CG, SHO, SV, E) have skewed data. We will adjust the skewness for values above +/-0.5.We are making these columns (H, CG, SHO, SV, E) more symmetric by reducing their skewness.Some columns (H, CG, SHO, SV, E) have skewed data. We will adjust the skewness for values above +/-0.5.
We are making these columns (H, CG, SHO, SV, E) more symmetric by reducing their skewness.
xxxxxxxxxxfrom sklearn.preprocessing import power_transformx[['H','CG','SHO','SV','E']]=power_transform(x[['H','CG','SHO','SV','E']],method='yeo-johnson')xxxxxxxxxxx.skew()xxxxxxxxxxThe skewness has been almost eliminated from every column.The skewness has been almost eliminated from every column.
xxxxxxxxxxfrom sklearn.preprocessing import MinMaxScalersc=MinMaxScaler()x=sc.fit_transform(x)xxxxxxxxxxSo here not scaling the data it's making lot's of values==0So here not scaling the data it's making lot's of values==0
xxxxxxxxxxpd.DataFrame(x).isnull().sum()xxxxxxxxxxpd.DataFrame(x).describe()xxxxxxxxxxWe can see that the data has been scaledWe can see that the data has been scaled
xxxxxxxxxx# Creating training and testing subsets from the dataxxxxxxxxxxfrom sklearn.metrics import mean_squared_error,mean_absolute_errorfrom sklearn.metrics import r2_scorefrom sklearn.model_selection import train_test_splitfrom sklearn.model_selection import cross_val_scorexxxxxxxxxx# We look for the optimal random state in the following cellxxxxxxxxxxfrom sklearn.linear_model import LinearRegressionLR=LinearRegression()for i in range(0,100): x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.22,random_state=i) LR.fit(x_train,y_train) LR_predict_train=LR.predict(x_train) LR_predict_test=LR.predict(x_test) print(f'At random state{i}, The training accuracy is :-{r2_score(y_train,LR_predict_train)}') print(f'At random state{i}, The test accuracy is :-{r2_score(y_test,LR_predict_test)}') print('\n')xxxxxxxxxxSince random state=99 yields the best accuracy, we select it as the random state.Since random state=99 yields the best accuracy, we select it as the random state.
xxxxxxxxxx# We divide the data into a 78% train set and a 22% test set.x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.22,random_state=99)xxxxxxxxxxx_train.shapexxxxxxxxxxy_train.shapexxxxxxxxxxx_test.shapexxxxxxxxxxy_test.shapexxxxxxxxxxfrom sklearn.linear_model import LinearRegressionLR=LinearRegression()LR.fit(x_train,y_train)print(LR.score(x_train,y_train))LR_predict=LR.predict(x_test)xxxxxxxxxxprint('MSE:',mean_squared_error(LR_predict,y_test))print('MAE:',mean_absolute_error(LR_predict,y_test))print('r2_score:',r2_score(LR_predict,y_test))xxxxxxxxxxfrom sklearn.linear_model import RidgeR=Ridge()R.fit(x_train,y_train)print(R.score(x_train,y_train))R_predict=R.predict(x_test)xxxxxxxxxxprint('MSE:',mean_squared_error(R_predict,y_test))print('MAE:',mean_absolute_error(R_predict,y_test))print('r2_score:',r2_score(R_predict,y_test))xxxxxxxxxxfrom sklearn.svm import SVRsvr=SVR(kernel='linear')svr.fit(x_train,y_train)print(svr.score(x_train,y_train))svr_predict=svr.predict(x_test)xxxxxxxxxxprint('MSE:',mean_squared_error(svr_predict,y_test))print('MAE:',mean_absolute_error(svr_predict,y_test))print('r2_score:',r2_score(svr_predict,y_test))xxxxxxxxxxsvr_p=SVR(kernel='poly')svr_p.fit(x_train,y_train)print(svr_p.score(x_train,y_train))svrpred_p=svr_p.predict(x_test)xxxxxxxxxxprint('MSE:',mean_squared_error(svrpred_p,y_test))print('MAE:',mean_absolute_error(svrpred_p,y_test))print('r2_score:',r2_score(svrpred_p,y_test))xxxxxxxxxxsvr_r=SVR(kernel='rbf')svr_r.fit(x_train,y_train)print(svr_r.score(x_train,y_train))svrpred_r=svr_r.predict(x_test)xxxxxxxxxxprint('MSE:',mean_squared_error(svrpred_r,y_test))print('MAE:',mean_absolute_error(svrpred_r,y_test))print('r2_score:',r2_score(svrpred_r,y_test))xxxxxxxxxxfrom sklearn.ensemble import RandomForestRegressorRF = RandomForestRegressor()RF.fit(x_train,y_train)print(RF.score(x_train,y_train))RF_PRED=RF.predict(x_test)xxxxxxxxxxprint('MSE:',mean_squared_error(RF_PRED,y_test))print('MAE:',mean_absolute_error(RF_PRED,y_test))print('r2_score:',r2_score(RF_PRED,y_test))xxxxxxxxxxfrom sklearn.tree import DecisionTreeRegressorDTR = DecisionTreeRegressor()DTR.fit(x_train,y_train)print(DTR.score(x_train,y_train))DTR_PRED=DTR.predict(x_test)xxxxxxxxxxprint('MSE:',mean_squared_error(DTR_PRED,y_test))print('MAE:',mean_absolute_error(DTR_PRED,y_test))print('r2_score:',r2_score(DTR_PRED,y_test))xxxxxxxxxxfrom sklearn.ensemble import GradientBoostingRegressorGBR = GradientBoostingRegressor()GBR.fit(x_train,y_train)print(GBR.score(x_train,y_train))GBR_PRED=GBR.predict(x_test)xxxxxxxxxxprint('MSE:',mean_squared_error(GBR_PRED,y_test))print('MAE:',mean_absolute_error(GBR_PRED,y_test))print('r2_score:',r2_score(GBR_PRED,y_test))xxxxxxxxxxfrom sklearn.model_selection import cross_val_scorenp.random.seed(10)def rmse_cv(model, x,y): rmse =- (cross_val_score(model, x,y, scoring='neg_mean_squared_error', cv=10)) return(rmse)models = [LinearRegression(), Ridge(), SVR(kernel='linear'), SVR(kernel='poly'), SVR(kernel='rbf'), RandomForestRegressor(), DecisionTreeRegressor(), GradientBoostingRegressor(),]names = ['LR', 'R', 'svr', 'svr_p', 'svr_r', 'RF', 'DTR', 'GBR']for model,name in zip(models,names): score = rmse_cv(model,x,y) print("{} : {:.6f}, {:4f}".format(name,score.mean(),score.std()))xxxxxxxxxxBased on all the metrics scores, we choose LinearRegression as the final model.Based on all the metrics scores, we choose LinearRegression as the final model.
xxxxxxxxxxfrom sklearn.model_selection import GridSearchCVxxxxxxxxxxLR = LinearRegressionparam={ 'fit_intercept':[True,False], 'copy_X':[True], 'n_jobs':[-1], 'positive' : [True],}xxxxxxxxxxLR_grid=GridSearchCV(LinearRegression(),param,cv=4,scoring='accuracy',n_jobs=-1,verbose=2)xxxxxxxxxxLR_grid.fit(x_train,y_train)LR_grid_Pred=LR_grid.best_estimator_.predict(x_test)xxxxxxxxxxLR_grid.best_estimator_.predict(x_test)xxxxxxxxxxprint('MSE:',mean_squared_error(LR_grid_Pred,y_test))print('MAE:',mean_absolute_error(LR_grid_Pred,y_test))print('r2_score:',r2_score(LR_grid_Pred,y_test))xxxxxxxxxxLR_grid_Predxxxxxxxxxxsns.distplot(LR_grid_Pred-y_test)xxxxxxxxxxplt.scatter(LR_grid_Pred,y_test)plt.plot(y_test,y_test,linewidth=4,color='Red')xxxxxxxxxxWe select LinearRegressor as the best model after using GridSearchCV.We select LinearRegressor as the best model after using GridSearchCV.
xxxxxxxxxximport joblibxxxxxxxxxxjoblib.dump(LR_grid.best_estimator_,'Baseball Case Study_Project.obj')xxxxxxxxxxdf.columnsxxxxxxxxxxdf=df[[ 'R', 'AB', 'H', '2B', '3B', 'HR', 'BB', 'SO', 'SB', 'RA', 'ER', 'ERA', 'CG', 'SHO', 'SV', 'E','W']]xxxxxxxxxxplt.figure(figsize=(12,10))sns.heatmap(df.corr(),annot=True,linecolor='white',linewidths=.25)xxxxxxxxxxCorrelation analysis, First independent vs dependentR,HR,2B,BB,SHO, SV have a strong positive correlation with the target variable (W)AB, H, 3B, SO,SB,CG,E have a weak correlation with the target variable (both positive and negative)RA, ER, ERA have a strong negative correlation with the target variable and with each other. These features can bias the result. We need to decide whether to drop any of them.AB and H are highly correlated to each other at 74%Correlation analysis, First independent vs dependent R,HR,2B,BB,SHO, SV have a strong positive correlation with the target variable (W) AB, H, 3B, SO,SB,CG,E have a weak correlation with the target variable (both positive and negative) RA, ER, ERA have a strong negative correlation with the target variable and with each other. These features can bias the result. We need to decide whether to drop any of them. AB and H are highly correlated to each other at 74%
xxxxxxxxxxdf.columnsxxxxxxxxxx#Relationship between Runs scored and Winsns.scatterplot('R','W',data=df)xxxxxxxxxx#sns.relplot(x=“R”, y=“W”, data=df);There is a positive relationship between Runs scored and Wins, the more runs, the higher the chances of winning.#sns.relplot(x=“R”, y=“W”, data=df);
There is a positive relationship between Runs scored and Wins, the more runs, the higher the chances of winning.
xxxxxxxxxx#Relationship between AB and Winsns.scatterplot('AB','W',data=df)xxxxxxxxxxThe data is scattered and does not show any clear trend, it is not useful for predicting the chances of winning, it has a weak correlation.The data is scattered and does not show any clear trend, it is not useful for predicting the chances of winning, it has a weak correlation.
xxxxxxxxxx#Relationship between H and Winsns.scatterplot('H','W',data=df)xxxxxxxxxx#The data is scattered and does not show any clear trend, it has a very low correlation with Win at 0.038, we can drop #this feature, it will not help to predict the win.xxxxxxxxxx#Relationship between 2B and Winsns.scatterplot('2B','W',data=df)xxxxxxxxxx#There is a positive relationship between 2B and Win, the more 2B, the higher the chances of winning.xxxxxxxxxx#Relationship between 3B and Winsns.scatterplot('3B','W',data=df)xxxxxxxxxx#This distribution has a weak correlation, it will not help to predict Win better.xxxxxxxxxx#Relationship between HR and Winsns.scatterplot('HR','W',data=df)xxxxxxxxxx#There is a positive relationship.xxxxxxxxxxdf.columnsxxxxxxxxxx# Relationship between BB and Winsns.scatterplot('BB','W',data=df)xxxxxxxxxx#There is a positive relationship between BB and Win.xxxxxxxxxx# Relationship between SO and Winsns.scatterplot('SO','W',data=df)xxxxxxxxxx#There is a weak positive relationship.xxxxxxxxxx# Relationship between SB and Winsns.scatterplot('SB','W',data=df)xxxxxxxxxx#There is a weak negative relationship.xxxxxxxxxx#Relationship between RA and Winsns.scatterplot('RA','W',data=df)xxxxxxxxxx#There is a strong negative relationship.xxxxxxxxxx# Relationship between ER and Winsns.scatterplot('ER','W',data=df)xxxxxxxxxx#Strongly inversely related.xxxxxxxxxx#Relationship between ERA and Winsns.scatterplot('ERA','W',data=df)xxxxxxxxxx#Strongly inversely related.xxxxxxxxxx# Relationship between CG and Win sns.scatterplot('CG','W',data=df)xxxxxxxxxx#Weakly associated.xxxxxxxxxx# Relationship between SHO and Win sns.scatterplot('SHO','W',data=df)xxxxxxxxxx#Directly related.xxxxxxxxxx# Relationship between SV and Win sns.scatterplot('SV','W',data=df)xxxxxxxxxx#Directly related.xxxxxxxxxx# Relationship between E and Win sns.scatterplot('E','W',data=df)xxxxxxxxxx#Weakly associated.xxxxxxxxxx# Summary: Some features have a weak association with the target variable, while others have a strong inverse relationship.xxxxxxxxxxIn addition, there is multicollinearity, meaning some independent features are directly related to each other.In addition, there is multicollinearity, meaning some independent features are directly related to each other.
xxxxxxxxxxdf.columnsxxxxxxxxxxdf.head(3)xxxxxxxxxxInput features: Runs, At Bats, Hits, Doubles, Triples, Homeruns, Walks, Strikeouts, Stolen Bases, Runs Allowed,Earned Runs, Earned Run Average (ERA), Shutouts, Saves, Complete Games and Errors#R – Runs scored: how many times a player reaches home plate #AB – At bat: times a player faces the pitcher, excluding walks, hit by pitch, sacrifices, interference, or obstruction#H – Hit: getting to base by hitting a fair ball without any error by the defense #2B – Double: hits that allow the batter to safely get to second base without any error by the defense #3B – Triple: hits that allow the batter to safely get to third base without any error by the defense#HR – Home runs: hits that let the batter touch all four bases without any error by the defense#BB – Base on balls (also called a “walk”): getting to first base by not swinging at four pitches outside the strike zone#SB – Stolen base: advancing one or more bases while the ball is held by the defense #RA - Runs scored by opponents #ER - Runs allowed that are not due to errors #ERA - Average number of earned runs allowed per nine innings#CG - Games where the pitcher pitches for the entire game#SHO - Games where the pitcher does not allow any runs#SV - Games where the pitcher preserves a lead in the final inning#E - Mistakes made by the defense that allow a batter or runner to advance one or more basesxxxxxxxxxxsns.scatterplot('R','W',hue='AB',data=df) # At bats, ABxxxxxxxxxxdf.head(50)xxxxxxxxxxfor i in df.columns: df[i].nunique() print("unique values of feature ",i, '= ',df[i].nunique())xxxxxxxxxxdf.CG.unique()xxxxxxxxxxsns.scatterplot('R','W',size='CG',data=df)xxxxxxxxxxsns.scatterplot('CG','W',data=df)xxxxxxxxxx#Complete Game is not related to WIN, as it is possible to play a complete game and still lose some matches#A complete game is when a pitcher stays on the mound for his team for the whole game, no matter how many innings it has.xxxxxxxxxxsns.scatterplot('ER', 'RA', hue='ERA',data=df)xxxxxxxxxx#RA - Runs scored by opponents#ER - Runs allowed that are not due to errors#ERA - Average number of earned runs allowed per nine innings#Earned runs are the main factor in ERA, which is the most common measure of a pitcher’s performance. When there are no errors or passed balls in an inning or a game, all the runs in that inning or game are earned runs.#RA and ER seem to be the same thing, meaning RA - runs scored by opponents and ER - runs allowed without any errors.xxxxxxxxxxStrong relationship, ready to investigate it.xxxxxxxxxxv= df.drop('W', axis=1)v.head(2)xxxxxxxxxxfrom sklearn.preprocessing import StandardScalersc= StandardScaler()scaled= sc.fit_transform(v)xxxxxxxxxxfrom statsmodels.stats.outliers_influence import variance_inflation_factorxxxxxxxxxxVIF= pd.DataFrame()VIF['features']=v.columnsxxxxxxxxxxVIF['vif']= [variance_inflation_factor(scaled,i) for i in range(len(v.columns))]xxxxxxxxxxVIFxxxxxxxxxx#Several features have a VIF higher than 5,#Removing the feature ‘ER’xxxxxxxxxxv= df.drop(['W','ER'], axis=1)xxxxxxxxxxv.head(2)xxxxxxxxxxscaled=sc.fit_transform(v)xxxxxxxxxxVIF2= pd.DataFrame()VIF2['features']=v.columnsVIF2['vif']= [variance_inflation_factor(scaled,i) for i in range(len(v.columns))]xxxxxxxxxxVIF2xxxxxxxxxx#Dropping the RA columnxxxxxxxxxxv= df.drop(['W','ER','RA'], axis=1)xxxxxxxxxxscaled=sc.fit_transform(v)xxxxxxxxxxVIF3= pd.DataFrame()VIF3['features']=v.columnsVIF3['vif']= [variance_inflation_factor(scaled,i) for i in range(len(v.columns))]xxxxxxxxxxVIF3xxxxxxxxxxv.head(3)xxxxxxxxxxplt.figure(figsize=(12,10))sns.heatmap(v.corr(), annot=True,linecolor='black', linewidths=.25)xxxxxxxxxx#All the features with strong correlation are now gone.xxxxxxxxxxv.skew()xxxxxxxxxxfor i in v.columns: sns.boxplot(v[i]) plt.show()xxxxxxxxxxv.columnsxxxxxxxxxx#Outliers are among the more prominent features. Let's handle R, ERA, SHO, SV, and E.xxxxxxxxxxIQR= df['R'].quantile(.75)-df['R'].quantile(.25)IQRupper= df['R'].quantile(.75) +( 1.5 * IQR)upperxxxxxxxxxxv['R']= np.where(v['R']>upper,upper,v['R'])xxxxxxxxxxIQR= df['ERA'].quantile(.75)-df['ERA'].quantile(.25)IQRupper= df['ERA'].quantile(.75) +( 1.5 * IQR)upperv['ERA']= np.where(v['ERA']>upper,upper,v['ERA'])xxxxxxxxxxIQR= df['SHO'].quantile(.75)-df['SHO'].quantile(.25)IQRupper= df['SHO'].quantile(.75) +( 1.5 * IQR)upperv['SHO']= np.where(v['SHO']>upper,upper,v['SHO'])xxxxxxxxxxIQR= df['SV'].quantile(.75)-df['SV'].quantile(.25)IQRupper= df['SV'].quantile(.75) +( 1.5 * IQR)upperv['SV']= np.where(v['SV']>upper,upper,v['SV'])xxxxxxxxxxIQR= df['E'].quantile(.75)-df['E'].quantile(.25)IQRupper= df['E'].quantile(.75) +( 1.5 * IQR)upperv['E']= np.where(v['E']>upper,upper,v['E'])xxxxxxxxxxfor i in v.columns: sns.boxplot(v[i]) plt.show()xxxxxxxxxx#handled anomaliesxxxxxxxxxxv.skew()xxxxxxxxxx# After dealing with outliers, skewness has also been eliminated.xxxxxxxxxxfrom sklearn.preprocessing import power_transformtransformed=power_transform(v)xxxxxxxxxxtransformed=pd.DataFrame(transformed)transformed.columns=v.columnsxxxxxxxxxxtransformed.skew()xxxxxxxxxxtransformed.head(2)xxxxxxxxxx#skewness is no longer present.xxxxxxxxxxfrom sklearn.preprocessing import StandardScalersc=StandardScaler()scaled=sc.fit_transform(transformed)xxxxxxxxxx# Distinguish between independent and dependent properties.xxxxxxxxxxX= scaledxxxxxxxxxxY=df['W']xxxxxxxxxxX.shapexxxxxxxxxxY.shapexxxxxxxxxx# As the output variable has continuous data and sales predictions, it is a regression problem.from sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import r2_score,mean_absolute_error,mean_squared_errorLR= LinearRegression()xxxxxxxxxx# Finding the best random_state for the model during the train-test splitfor i in range(0,200): x_train,x_test,y_train,y_test= train_test_split(X,Y,random_state=i,test_size=.2) LR.fit(x_train,y_train) train_pred=LR.predict(x_train) test_pred=LR.predict(x_test) if round(r2_score(y_test,test_pred),2)==round(r2_score(y_train,train_pred),2): print("At random state ", i, "The model performance very well") print("At random state: ",i) print("Test R2 score is: ", round(r2_score(y_test,test_pred),2)) print('Train R2 score is: ', round(r2_score(y_train,train_pred),2)) print('X'*50,'\n')xxxxxxxxxx# State selection: 175x_train,x_test,y_train,y_test= train_test_split(X,Y,random_state=175,test_size=.2)LR.fit(x_train,y_train)xxxxxxxxxxfrom sklearn.linear_model import LinearRegressionfrom sklearn.linear_model import Ridge, Lassofrom sklearn.tree import DecisionTreeRegressorfrom sklearn.svm import SVRfrom sklearn.neighbors import KNeighborsRegressorfrom sklearn.ensemble import RandomForestRegressorfrom xgboost import XGBRegressorfrom sklearn.linear_model import ElasticNetfrom sklearn.linear_model import SGDRegressorfrom sklearn.ensemble import BaggingRegressorfrom sklearn.ensemble import AdaBoostRegressorfrom sklearn.ensemble import GradientBoostingRegressorxxxxxxxxxxLR_model= LinearRegression()RD_model= Ridge()LS_model= Lasso()DT_model= DecisionTreeRegressor()SV_model= SVR()KNR_model= KNeighborsRegressor()RFR_model= RandomForestRegressor()XGB_model= XGBRegressor()Elastic_model= ElasticNet()SGH_model= SGDRegressor()Bag_model=BaggingRegressor()ADA_model=AdaBoostRegressor()GB_model= GradientBoostingRegressor()model=[LR_model,RD_model,LS_model,DT_model,SV_model,KNR_model,RFR_model,XGB_model,Elastic_model,SGH_model,Bag_model,ADA_model,GB_model]xxxxxxxxxxfor m in model: m.fit(x_train,y_train) print('mean_absolute_error of ',m ,'model', mean_absolute_error(y_test,m.predict(x_test))) print('mean_square_error of',m,'model' , mean_squared_error(y_test,m.predict(x_test))) print('R2 Score of',m,'model', r2_score(y_test,m.predict(x_test) )*100) print('X' * 50, '\n\n')xxxxxxxxxxfrom sklearn.model_selection import cross_val_scorexxxxxxxxxxfor i in model: print('mean_square of ',i, 'model',mean_squared_error(y_test,i.predict(x_test))) print("cross Validation score of ",i ," is ",cross_val_score(i,X,Y,cv=10, scoring='neg_mean_squared_error').mean()) print('*'*50)xxxxxxxxxxfor i in model: print('Root mean_square of ',i, 'model',np.sqrt(mean_squared_error(y_test,i.predict(x_test)))) score=cross_val_score(i,X,Y,cv=10, scoring='neg_mean_squared_error').mean() print("cross Validation score of root mean square ",i ," is ",np.sqrt(-score)) print('*'*50)xxxxxxxxxxfrom sklearn.model_selection import GridSearchCVxxxxxxxxxxparams= {"learning_rate" : [0.01,.05,.1,.2,.3,.5 ] , "max_depth" : [ 3, 4, 5, 6, 8], "min_child_weight" : [ 1, 3, 5, 7 ], "gamma" : [ 0.01, 0.05,0.1, 0.2 , 0.3], "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ] }xxxxxxxxxxGCV= GridSearchCV(XGB_model,params,cv=5,scoring='neg_mean_squared_error', n_jobs=-1)GCV.fit(x_train,y_train)xxxxxxxxxxGCV.best_params_xxxxxxxxxxGCV_pred=GCV.best_estimator_.predict(x_test)mean_squared_error(y_test,GCV_pred)xxxxxxxxxx#error has dropped from 81.98 to 76.xxxxxxxxxximport joblibjoblib.dump(GCV.best_estimator_,"Baseball.pkl")xxxxxxxxxx